fix: return only requested pages from get_page_content on Markdown by akhilesharora · Pull Request #280 · VectifyAI/PageIndex

akhilesharora · 2026-05-16T17:25:07Z

get_page_content's docstring says '3,8' means pages 3 and 8 (pageindex/retrieve.py:111-119). PDFs honor that. Markdown doesn't: _get_md_page_content takes min(page_nums)/max(page_nums) and returns everything in between.

Same input, before:

md  pages="5,100" -> [5, 10, 50, 100]
pdf pages="5,100" -> [5, 100]

After:

md  pages="5,100" -> [5, 100]
pdf pages="5,100" -> [5, 100]

_parse_pages already returns a discrete sorted list, so the simplest fix is to match against set(page_nums) instead of [min..max]:

wanted = set(page_nums)
...
if ln in wanted and ln not in seen:

Range form ('5-7') still parses to [5,6,7] so it keeps working. Only the comma-list shape changes.

Added tests/test_retrieve_pages.py with four cases. Two of them fail on main and pass with the fix. The other two cover the range and single-page forms to make sure nothing regresses there.

$ python -m pytest tests/test_retrieve_pages.py -v
==================== 4 passed in 1.87s =====================================

Files touched:

pageindex/retrieve.py: 4 lines of logic, plus a docstring update so it matches the new behavior.
tests/test_retrieve_pages.py: new, self-contained, no LLM or PDF needed.

No PDF behavior change, no API surface change.

Closes #279

get_page_content's docstring describes '3,8' as two discrete pages. The PDF branch honors that. The Markdown branch was treating the list as the inclusive range [min..max] and pulling in every heading between them. Match against the parsed set instead so both branches agree. Range form '5-7' still parses to [5,6,7] so range queries are unchanged; only the comma-list shape changes behavior. Closes VectifyAI#279

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: return only requested pages from get_page_content on Markdown#280

fix: return only requested pages from get_page_content on Markdown#280
akhilesharora wants to merge 1 commit into
VectifyAI:mainfrom
akhilesharora:fix/md-discrete-pages-overcollection

akhilesharora commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

akhilesharora commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant